Goto

Collaborating Authors

 unification score


Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

arXiv.org Artificial Intelligence

Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs. This behavior has been attributed to training and sampling noise, gaps in pretraining data (Xu et al., 2024), and misaligned incentives in post-training (Schul-man, 2023). However, these fail to explain cross-lingual factual errors: cases where models accurately answer questions when posed in the same language as the training data, yet hallucinate when prompted in a different (often lower-resource) language (Goldman et al., 2025). Failures of cross-lingual transfer exacerbate disadvantages faced by speakers of underrepresented languages, and increasing model scale does not solve the problem (Aggarwal et al., 2025; Qi et al., 2023). LLMs have been found to develop both a lingua franca for factual knowledge (typically based on English) and distinct language silos (Aggarwal et al., 2025; Lim et al., 2025b; Schut et al., 2025; Lim et al., 2025a, inter alia), and their hidden representations can be language-agnostic or language-specific depending on the layer (Wang et al., 2025). However the root cause of these phenomena is not understood, as most research on cross-lingual transfer analyzes models as static artifacts. Such analysis, while valuable, cannot explain how knowledge arises during training, and therefore cannot lead to effective pre-training interventions. While some have investigated the training dynamics of knowledge acquisition in multilingual LLMs (Zeng et al., 2025; Liu et al., 2025), their approach is non-interventional and does not establish a causal link between data properties and cross-lingual transfer.


Unification-based Reconstruction of Explanations for Science Questions

arXiv.org Artificial Intelligence

The paper presents a framework to reconstruct explanations for multiple choices science questions through explanation-centred corpora. Building upon the notion of unification in science, the framework ranks explanatory facts with respect to question and candidate answer by leveraging a combination of two different scores: (a) A Relevance Score (RS) that represents the extent to which a given fact is specific to the question; (b) A Unification Score (US) that takes into account the explanatory power of a fact, determined according to its frequency in explanations for similar questions. An extensive evaluation of the framework is performed on the Worldtree corpus, adopting IR weighting schemes for its implementation. The following findings are presented: (1) The proposed approach achieves competitive results when compared to state-of-the-art Transformers, yet possessing the property of being scalable to large explanatory knowledge bases; (2) The combined model significantly outperforms IR baselines ( 7.8/8.4 MAP), confirming the complementary aspects of relevance and unification score; (3) The constructed explanations can support downstream models for answer prediction, improving the accuracy of BERT for multiple choices QA on both ARC easy ( 6.92%) and challenge ( 15.69%) questions.